System Performance Analysis Summary

 

vmstat 2 10: allows you to look for performance problems that are CPU, memory, and I/O bound

 

CPU BOUND

cpu

us: % of CPU time spent in user mode

sy: % of CPU time that was spent executing a process in system mode

us + sy < 80%: if this exceeds 80%, processes may spend time in a run queue

id: % of CPU time spent idle without pending local disk I/O

wa: % of CPU time spent idle with pending local disk I/O

wa < 25%: if this exceeds 25%, the disk subsystem may be improperly balanced, or may be the result of a disk-intensive workload

 

kthr

r: average # of kernel threads placed in the run queue per second

r < 5: if this increases rapidly, it may be the application(s), although some systems may run fine with 10-15 threads on their run queue, depending on the tasks and the amount of time they run

b: average number of kernal threads placed on the wait queue per second

 

fauts

sy: number of system calls per second

sy < 10000: if this exceeds 10000, it may indicate a problem, although applications vary widely - there should be a baseline measurement for a normal sy value

 

MEMORY BOUND

memory

avm / 256 = roughly the number of MB allocated to paging space system wide

fre: average # of free memory pages

minfree: for systems with > 64 MB, the minimum value minfree is 120 pages. For systems with < 64 MB, the value is 2 times the number of MB of real memory, minus 8.

fre > minfree: if fre drops below minfree the VMM will steal pages until the free list is restored to maxfree, (which is minfree + 8).

If a system is experiencing 'thrashing', the fre value will be small.

 

page

re: the number of page reclaims per second - if a page fault occurs and the page is currently on the free list, but not yet re-assigned (reclaiming is no longer supported in AIX V4)

pi: details the number of pages paged in from paging space

pi < 5 per second: this is not absolute, but 5 should generally be the upper limit

po: the number of pages paged out to paging space.

an increase in po without a corresponding increase in pi may indicate thrashing or problems with data access patterns of applications

thrashing: po / fr > 1/h : in AIX V4, the default value for h is 6 for systems with < 128 MB of RAM and 0 for systems with RAM >= 128 MB

the system slows down when pi and po are consistantly non-zero

fr: number of pages per second that were freed by the page replacement algorithm

sr: number of pages per second that were examined by the page replacement algorithm

when the ratio of fr to sr (fr:sr) is high: this means that memory is over-committed

example- an fr:sr ratio of 1:4 means that four pages had to be examined to free one page. It's a good idea to have a baseline measurement of the system for this when the system is running fine.

cy: number of cycles per second of the clock algorithm

paging space calculation: For systems up to 256 MB of memory, paging space should be twice the size of real memory. For systems larger than 256 MB of memory, the recommended paging space is:

512 + (memory size - 256 MB) * 1.25

The paging space size cannot be less than 16 MB and not greater than 20% of total disk space.

I/O BOUND

 

vmstat hdisk0 hdisk1 1 8

disk xfer: this shows the number of transfers per second to the specified physical volumes (up to 4 physical volumes)

The statistics are given for each drive in the order they were specified.

vmstat -i: This shows the number of interrupts taken by each device since system startup. With an interval and count parameter, every trailing stanza is a statistic about the scanned interval. High numbers in the count column may indicate a problem.

vmstat -s: This reports absolute counts of various events since the system was booted. This should be run before and after a workload, then determine the difference between the two outputs.

page ins and page outs: virtual memory activity to page-in or page-out pages from page space and file space.

paging space page ins and paging space page outs: paging space only

page ins - paging space ins = the number of pages that were read from persistent storage

page outs - paging space outs = the number of persistent pages that were written to disk

If system paging is too much, vmtune may help. Creating separate paging spaces on separate volumes may help, but increasing the memory would definitely help.

 

iostat: fastest way to see if there is an I/O bound performance problem. If used with an interval and count parameter, the first stanza should be ignored because it is an average since system boot.

 

iostat -t 2 10

tin: total characters per second read by all tty devices

tout: total characters per second written to all tty devices

Look for a correlation between increased tty activity and CPU utilization. Options may be to modify port parameters or upgrade the async adaptor.

 

% user: % of CPU time spent in user mode

% sys: % of CPU time that was spent executing a process in system mode

% idle: % of CPU time spent idle without pending local disk I/O

% iowait: % of CPU time spent idle with pending local disk I/O. A high # usually indicates the system has a memory shortage or inefficient I/O subsystem configuration.

Typical solutions to balance I/O subsystems:

                - create multiple Journaled File Systems logs for a volume group and assign them to specific file      

                  systems

- backup and restore file systems to reduce fragmentation

- add additional drives and rebalance the existing I/O subsystem

If iostat indicates no CPU-bound situation and % iowait > 25%, there is an I/O or disk-bound situation. This may be due to excessive paging space due to a lack of real memory. It could also be due to unbalanced disk load, fragmented data, or usage patterns.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

iostat -d hdisk0 hdisk1 2 10

Disks: shows the names of physical volumes

% tm_act: % of time the physical disk was active. As disk use increases, performance decreases and response time increases. Generally, % tm_act < 40 on a normal system. Moving data from busy to idle drives may help alleviate a bottleneck.

Kbps: amount of data written or read to the drive (in kbps / s).

To see if  a SCSI adaptor is saturated, add up all Kbps amounts for all disks attached to the adaptor.

tps: transferrs per second that were issued to the physical disk.

Kb_read: total data (in KB) read from the physical volume during the interval.

KB_wrtn: total data (in KB) written to the physical volume during the interval.

A high disk busy rate with a low disk transfer rate may indicate fragmented logical volumes, file systems, or individual files.

An average physical volume utilization > 25% across all disks indicates an I/O bottleneck.

For maximum performance, the total Kbps < SCSI adaptor throughput rating. In most cases, use 70% of the throughput rate.

 

ps: used to see what resources programs are using

 

ps au: information on user processes

%CPU = (process CPU time / process duration) * 100

It is normal to see a process called kproc using CPU time.

 

ps v: most comprehensive report on memory-related statistics

PID: the process ID

TTY: the controlling workstation for the process

STAT: the state of the process (A means active)

TIME: total execution time for the process

PGIN: number of page-ins caused by page faults

SIZE: size of working segment that has been touched (in 1 KB units)

RSS: sum of the size of working segment and code segment in memory (in 1 KB units)

LIM: the soft limit on memory (xx means no limit has been set)

TSIZ: size of the text (shared-program) image

TRS: size of the resident-set (real memory) of text

%CPU: percentage of time the process has used the CPU since the process started

%MEM: percentage of real memory used by the process. RSS / memory size (in KB) * 100 (rounded to the nearest percentage)

COMMAND: the name of the command

 

____________________________________________________________

Useful scripts:

ps -e|head -n1;ps -e|egrep -v "TIME|0:"|sort +2b -3 -n -r|head -n 10

This shows the top 10 processes that have accumulated the most CPU time

 

ps -ef|head -n 1;ps -ef|egrep -v "C|0:00| 0 "|sort +3b -4 -n -r|head -n 10

This shows the top 10 processes that have the most recent CPU usage

 

ps gu|head -n 1;ps gu|egrep -v "CPU|kproc"|sort +2b -3 -n -r|head -n 10

This shows the top 10 processes that have the most CPU usage

____________________________________________________________

 

 

 

 

 

 

 

netstat :displays the contents of various network-related data structures for active connections. Count values are summarized since system startup.

 

netstat -i: shows the state of all configured interfaces.

Name: the name of the interface

Mtu: maximum transmission unit. This is the maximum size of packets (in bytes) that are transmitted using the interface.

Ipkts: total number of packets received.

Ierrs: total number of input errors.

Opkts: total number of packets transmitted.

Oerrs: total number of output errors.

Coll: number of packet collisions detected.

 

netstat -i -Z: this clears all of the statistic counters for the netstat -i command to 0.

 

netstat -I <interface> <interval>: displays statistics for the specified interface.

 

netstat -m: displays statistics recorded by the "mbuf" memory-management routines. If requests for mbufs or clusters shows being denied, you may want to increase the value of thewall by using:

no -o thewall=newvalue

 

netstat -v: this displays the statistics for each Common Data Link Interface (CDLI)-based device driver that is up.

Here are the most important statistics for each interface:

 

TOKEN-RING STATISTICS (tok0):

(Transmit Statistics)

                Transmit Errors: unsuccessful transmissions due to hardware / network errors

Max Packets on S/W Transmit Queue: max number of outgoing packets ever queued to the software transmit queue. A queue is too small if the maximal transmits queued equals the current queue size.

S/W Transmit Queue Overflow: the number of outgoing packets that have overflowed the software transmit queue. A value other that 0 requires the size of the queue to be increased.

                (Receive Statistics)

                Receive errors: unsuccessful transmissions due to hardware / network errors

Broadcast Packets: number of broadcast packets received without any error. This should be less than 20% of the total received packets

                (General Statistics)

                                No mbuf Errors: the number of times mbufs were not available to the device driver

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ETHERNET STATISTICS (ent1):

                (Transmit Statistics)

                Transmit Errors: unsuccessful transmissions due to hardware / network errors

Max Packets on S/W Transmit Queue: max number of outgoing packets ever queued to the software transmit queue. A queue is too small if the maximal transmits queued equals the current queue size.

S/W Transmit Queue Overflow: : the number of outgoing packets that have overflowed the software transmit queue. A value other that 0 requires the size of the queue to be increased.

Max Collision Errors: the number of unsuccessful collisions due to too many collisions. The number of collisions exceeded the number of retries on the adaptor.

Late Collision Errors: the number of unsuccessful transmissions due to the late collision error.

Timeout Errors: the number of unsuccessful transmissions due to the late collision error.

Single Collision Count: the number of outgoing packets with only one collision encountered during transmission.

Multiple Collision Count: the number of outgoing packets with multiple (2-15) collisions encountered during transmission.

                (Receive Statistics)

                                Receive errors: unsuccessful transmissions due to hardware / network errors.

Broadcast Packets: number of broadcast packets received without any error. This should be less than 20% of the total received packets.

Receive collision Errors: the number of incoming packets with collision errors during reception.

                (General Statistics)

                                No mbuf Errors: : the number of times mbufs were not available to the device driver.

 

netstat -p <protocol>: this shows statistics about a specific protocol (udp, tcp, ip, icmp).

Here are some fields to look for:

 

0 fragments dropped (dup or out of space)

0 bad header checksums: this could indicate either a network is corrupting packets or device driver receive queues are not large enough.

0 fragments dropped after timeout: could be due to lack of mbufs or fragments are dropped before all fragments of the datagram have arrived (increasing ipfragttl with the no command may avoid this).

0 bad checksums: could happen due to hardware or cable failure

0 dropped due to full socket buffers: could be due to insufficient transmit and receive UDP sockets, too few nfs daemons, and/or too small nfs_socketsize, udp_recvspace and sb_max values.

 

RULES OF THUMB

(from netstat -i)

Ierrs > 0.01 * Ipkts: execute netstat -m to check for lack of memory.

Oerrs > 0.01 * Opkts: the queue size (xmt_que_size) should be increased.

Check the size with lsattr -El <adaptor>

 

To check for an overloaded network calculate from netstat -v:

(Max Collision Errors + Timeouts Errors) / Transmit Packets

If the answer is > 5%, then the network should be reorganized to balance the load.

Number of Collisions / Number of Packets > 0.1: indicates a high network load.